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Abstract — A fundamental result of statistical learnig theory 
states that a concept class is PAC learnable if and only if it is a 
uniform Glivenko-Cantelli class if and only if the VC dimension 
of the class is finite. However, the theorem is only valid under 
special assumptions of measurability of the class, in which 
case the PAC learnability even becomes consistent. Otherwise, 
there is a classical example, constructed under the Continuum 
Hypothesis by Dudley and Durst and further adapted by 
Blumer, Ehrenfeucht, Haussler, and Warmuth, of a concept 
class of VC dimension one which is neither uniform Glivenko- 
Cantelli nor consistently PAC learnable. We show that, rather 
surprisingly, under an additional set-theoretic hypothesis which 
is much milder than the Continuum Hypothesis (Martin's 
Axiom), PAC learnability is equivalent to finite VC dimension 
for every concept class. 

I. Introduction 

The following is a fundamental result of statistical learning 
theory. 

Theorem 1: For a concept class c £ the following three 
conditions are equivalent: 

1) ^ is distribution-free PAC learnable, 

2) ^ is a uniform Glivenko-Cantelli class, and 

3) the Vapnik-Chervonenkis dimension of ^ is finite. 

It is in this form that the theorem is usually stated in 
textbooks on the subject, see [ 1 ], [2|. The condition 1) means 
the existence of a learning rule for ^ which is probably 
approximately correct. 

However, strictly speaking, the result is only true under 
a suitable measurability assumption on the concept class ^\ 
One such assumption is that of being image admissible 
Souslin: the class ^ can be parametrized with elements of 
the unit interval so that pairs (x,t), x £ Ct, t 6 [0, 1] form 
an analytic subset of O x [0,1] 0. Another measurability 
assumption, more difficult to state, is that of a well-behaved 
class 0. Under either of those conditions, the statement 
( 1 ) in Theorem Q] can be replaced with 

1') ^ is distribution-free consistently PAC learnable, 

meaning that every consistent learning rule C for <€ 
is distribution-free probably approximately correct. In the 
proof, a measurability hypothesis on ^ has to be invoked 
twice, in order to deduce implications (3) => (2) and (1' => 
(1). 
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In particular, Theorem Q~]holds for every countable class ^£ 
or, more generally, for every universally separable class [4|. 
It is arguable that every concept class emerging in either the- 
ory or applications of statistical learning will be measurable 
in a sufficiently strong sense. For this reason, a measurability 
condition on ^ is typically not even mentioned. 

The fact remains that Theorem [TJ cannot be derived in 
full generality. An example of a concept class <€ of Vapnik- 
Chervonenkis (VC) dimension one which is not uniform 
Glivenko-Cantelli was constructed by Durst and Dudley 0, 
and a further modification of this example, also of VC 
dimension one, fails consistent PAC learnability 0. 

This example has been constructed under Continuum Hy- 
pothesis (CH), which is arguably not a natural assumption 
in a probabilistic context [6|. However, the example remains 
valid under much more relaxed and natural set-theoretic 
hypothesis: Martin's Axiom (MA). It is one of the most often 
used and best studied additional set-theoretic assumptions 
beyond the standard Zermelo-Frenkel set theory with the 
Axiom of Choice (ZFC). In particular, Martin's Axiom 
follows from the Continuum Hypothesis (CH), but it is also 
compatible with the negation of CH, and in fact it is namely 
the combination MA+-1CH that is really interesting [7|, [8|, 
0. 

In this note we make the following, somewhat astonishing, 
observation: under the same assumption (Martin's Axiom), 
the conditions (1) and (3) in Theorem [T] are equivalent. Here 
is our main result. 

Theorem 2: Assume the validity of Martin's Axiom (MA). 
Then the following are equivalent for every concept class 
^ consisting of universally measurable subsets of a Borel 
domain D: 

1) ^ is distribution-free PAC learnable, and 

2) the Vapnik-Chervonenkis dimension of ^ is finite. 

Of course it is only the implication (2)=>(1) that needs 
proving, because (1)=>(2) is a well-known classical result 
from [2 1 which does not require any assumptions on ^ . 

We review a precise formal setting for learnability, after 
which we proceed to analysis of a counter-example from [5], 
0. We observe that the concept class ^ in the example is 
in fact PAC learnable, and this observation provides a clue 
to a general result. 

The construction of the learning rule £ can be described 
as a "first in, first served" approach. The concept class ^ 
is given a minimal well-ordering, -<, and L is constructed 
recursively, by assigning to a learning sample the -^-smallest 
consistent concept C with regard to the the ordering. As a 



consequence, for every concept C E c € , the image of all 
learning samples of the form (a, C D a) under C forms 
a uniform Glivenko-Cantelli class. It is for establishing 
this property of C that we need Martin's Axiom. Now the 
probable approximate correctness of C is straightforward. 

The present approach goes back to present author's earlier 
work iflOl , but the results are new and have never been stated 
explicitely before. 

II. The setting 

For obvious reasons, we need to be quite precise when 
fixing a general setting for learnability. The domain (instance 
space) D, = (Tt,£/) is a standard Borel space, that is, a 
complete separable metric space equipped with the sigma- 
algebra of Borel subsets (the smallest family of sets con- 
taining all open balls and closed under complements and 
countable intersections). 

Measures on f2 mean Borel probability measures, that is, 
countably additive functions on with values in the unit 
interval [0, 1], having the property /x(f2) = 1. We will not 
distinguish between a measure // and its Lebesgue comple- 
tion, that is, an extension of /i over a larger sigma-algebra of 
Lebesgue /i-measurable subsets of ft. Furthermore, recall that 
a subset A C f2 is universally measurable if it is Lebesgue 
^-measurable for every probability measure fi on 17. 

With this caveat, a concept class, c €, is a family of 
universally measurable subsets of f2. 

In the learning model, a set V of probability measures on SI 
is fixed. Usually either V = -P(Q) is the set of all probability 
measures (distribution-free learning), or V — {/i} is a single 
measure (learning under fixed distribution). In our article, the 
case of interest is the former, although some of our results 
are valid in the case of a general family V C P(Q). 

A learning sample is a pair (cr, r) of finite subsets of 17, 
where r C a is thought of as the set of points belonging to 
an unknown concept, C. The set of all samples of size n is 
usually identified with (17 x {0, 1})". 

A learning rule (for ff) is a mapping 

oo 

C: |J 17" x {0,1}"-^ 

71=1 

which satisfies the following measurability condition: for 
every C G c io, n G N and p, G V, the function 

17" 3 a h+ ft (£(cr, C n a) A C) E R (1) 

is measurable. 

A learning rule L is consistent (with a concept class ^) 
if for all C G n G N and cr G O n one has 

£(cr, cncr)no- = cncr. 

A learning rule £ is probably approximately correct (PAC) 
under V if for every e > 

p® n {a E 17™ : n (£(a, C n a) A C) > e} -> (2) 

as n — ► oo, uniformly over all C G ^ and /i G Here 
yU®" denotes me product measure on 17™. 



In terms of sample complexity function s(e, 5), a learning 
rule £ is PAC if for each C G ^ and every p, E V 
an independent identically distributed (i.i.d.) sample cr = 
(xi,X2, ■ ■ ■ ,x n ) with n > s(e, S) points has the property 
p,(C A C{a, C n a)) < e with confidence > 1 - 8. 

A concept class ^ is PAC learnable under V, if there 
exists a PAC learning rule for under V. A class ^ 
is consistently learnable (under V) if every learning rule 
consistent with ^ is PAC under V. If V = P(O) is 
the set of all probability measures, then ^ is said to be 
distribution-free PAC learnable. If V = {p,} is a single 
probability measure, one is talking of learning under a single 
distribution. Learnability under intermediate families V is 
also receiving considerable attention, cf. Chapter 7 in ifPTl . 

Notice that in this paper, we only talk of potential PAC 
learnability, adopting a purely information-theoretic view- 
point. As a consequence, our statements about learning rules 
are existential rather than constructive, and building learning 
rules by transfinite recursion is perfectly acceptable. 

A concept class ^ is uniform Glivenko-Cantelli with 
regard to a family of measures V, if for each e > 

sup p® n \ sup \[i{C) - Hn(C)\ > e [ as n ^ oo. 
tier Icetf J 

(3) 

Here fi n stands for the empirical (uniform) measure on n 
points, sampled in an i.i.d. fashion from SI according to 
the distribution /i. In this case, one also says that ^ has 
the property of uniform convergence of empirical measures 
(UCEM property) (with regard to V) ifTTI . 

Every uniform Glivenko-Cantelli concept class (with re- 
gard to V) is consistently PAC learnable (under V), as is 
easy to verify. In the distribution-free situation (V = P(Q)) 
the converse holds under additional measurability conditions 
on the class mentioned in the Introduction, but, as we will 
see, not always. 

More precisely, every distribution-free PAC learnable class 
has finite VC dimension (it was proved in 0, Theorem 
2.1(i); see also e.g. IfTTI . Lemma 7.2 on p. 279). Now the 
measurabilty conditions on ^ assure that a class ^ of finite 
VC dimension d is uniform Glivenko-Cantelli, with a sample 
complexity bound that does not depend on 'jf, but only on 
e, S, and d. The following is a typical (and far from being 
optimal) such estimate, which can be deduced, for instance, 
along the lines of [12|: 

s(e,S,d) < ^ ^log^logl) +log|) . (4) 

For our purposes, we will fix any such bound and refer to it 
as a "standard" sample complexity estimate for s(e,6, d). 

Now the consistent learnability for c €, with the same 
sample complexity, follows. Of course in order to conclude 
that ^ is PAC learnable, it is necessary to prove the existence 
of a consistent learning rule satisfying Eq. ([TJ. This is usually 
being done using subtle measurable selection theorems using 
the same measurability assumptions on ^ yet again. 

Finally, recall that a subset N C is universal null if for 
every non-atomic probability measure fj, on one has 



p(N') = for some Borel set N' containing N. Universal 
null Borel sets are just countable sets. 

III. Revisiting an example of Durst and Dudley 

The proof of the implication (3)=>(2) in Theorem Q] 
depends in an essential way on the Fubini theorem, which 
is why some measurability restrictions on the class ^ are 
unavoidable. Without them, the conclusion is not true in 
general. Here is a classical example of a concept class having 
finite VC dimension which is not uniform Glivenko-Cantelli. 

Example 3 (Durst and Dudley ^5^, Proposition 2.2): 
Assume the validity of the Continuum Hypothesis (CH). Let 
S7 be an uncountable standard Borel space, that is, up to an 
isomorphism, a Borel space associated to the unit interval 
[0, 1]. The statement of CH is equivalent to the existence of 
a total order -< on 57 with the property that every half-open 
initial segment I y = {x G SI : x -< y}, y G SI is countable, 
and -< is a well-ordering: every non-empty subset of SI has 
the smallest element. Fix such an order. 

Let consist of all half-open initial segments I y , y € Cl 
as above. Clearly, the VC dimension of the class c € is one. 

Now let /i be a non-atomic Borel probability measure on 
SI (e.g., the Lebesgue measure on [0, 1]). Under CH, every 
element of is a countable set, therefore Borel measurable 
of measure zero. At the same time, for every n and each i.i.d. 
random n-sample a, there is a countable initial segment C ~ 
I y G c € containing all elements of a. The empirical measure 
of C with regard to a is one. Thus, no finite sample guesses 
the measure of all elements of H to within an accuracy e < 1 
with a non-vanishing confidence. 

See also [13], p. 314; |3|, pp. 170-171. 

A further modification of this construction gives an exam- 
ple of a concept class of finite VC dimension which is not 
consistently PAC learnable. 

Example 4 (Blunter et al. l^j, p. 953): Again, assume 
CH. Add to the concept class ^ from Example [3] the set SI 
as an element, forming a new concept class ft' = ^ U {SI}. 
One still has VC(^") = 1. For a finite labelled sample 
(er, r) define 

C(a,T)=I z , z = min{?/ G (O,^): r C I y }. (5) 

The learning rule C is consistent with the class c €' . At 
the same time, C is not probably approximately correct. 
Indeed, for the concept C = SI the value of the learning 
rule C(a, il D a) = C(a,a) will always return a countable 
concept I y for some y G SI, and if p is a non-atomic Borel 
probability measure on Q, then /i(C AI y ) = 1. The concept 
C = SI cannot be learned to accuracy e < 1 with a non-zero 
confidence. 

Remark 5: It is important to note that — again, under CH 
— the class c £' is distribution-free PAC learnable. 

Indeed, redefine a well-ordering on ^ = {I x : x G 
SI} U {SI} by making SI the smallest element (instead of 
the largest one) and keeping the order relation between other 
elements the same. Denote the new order relation by -<i, and 
define a learning rule L\ similarly to Eq. (|5), but this time 



understanding the minimum with regard to the well-ordering 

£i(ct,t) = min \ C € <€: C n a = (~) D \ . (6) 
™ { ko J 

In essence, C\ examines all the concepts following a transfi- 
nite order on them, and returns the first encountered concept 
consistent with the sample, provided it exists. 

To see what difference it makes with Example |4] let p be 
again a non-atomic probability measure on SI. If C = SI, 
then for every sample a consistently labelled with C the 
rule L\ will return C, because this is the smallest consistent 
concept encountered by the algorithm. If C ^ SI, then for p- 
almost all samples a the labelling on a produced by C will 
be empty, and the concept £i(cr, 0) returned by Ct, while 
possibly different from C, will be again a countable concept, 
meaning that ^(C A C(a, 0)) = 0. 

To give a formal proof that C\ is PAC, notice that for every 
Ce?' and each n G N the collection of pairwise distinct 
concepts Ci(a fl C), a G SI™ is only countable (under CH), 
because they are all contained in the -initial segment of a 
minimally ordered set ^ of cardinality continuum, bounded 
by C itself. As a consequence, the concept class 

£f = {A(crnC): a G ST\ n G N} C <T (7) 

is also countable (assuming CH). The VC dimension of the 
family U{C} is < 1, and being countable, it is a uniform 
Glivenko-Cantelli class with a standard sample complexity 
as in Eq. ©. Consequently, given e, 5 > 0, and assuming that 
n is sufficiently large, one has for each probability measure 
p on SI and every a G SI™ 

p(C AC(a,CDa)) < e 

provided n > s(e, S, 1), as required. 

Remark 6: Notice that the role of the Continuum Hypoth- 
esis in the above examples was merely to assure that every 
initial segment I y , y G SI is a universally measurable set. 
As we will see, it can be achieved under a much milder 
assumption of Martin's Axiom. 

Remark 7: Thus, under the Continuum Hypothesis, the 
example of Dudley and Durst as modified by Blumer, 
Ehrenfeucht, Haussler, and Warmuth gives an example of a 
PAC learnable concept class which is not uniform Glivenko- 
Cantelli (even if having finite VC dimension). As it will 
become clear in the next Section, the assumption of CH can 
be weakened to Martin's Axiom. Still, it would be interesting 
to know whether an example with the same combination of 
properties can be constructed without additional set-theoretic 
assumptions. 

A basic observation of this Section is that in order for 
a learning rule C to be PAC, the assumption on ^ being 
uniform Glivenko-Cantelli can be weakened as follows. 

Lemma 8: Let ^ be a concept class and V a family of 
probability measures on the domain SI. Suppose there exists a 
function s(e, 5) and a learning rule C for H with the property 



that for every Cef, the set C c U {C} is Glivenko-Cantelli 
with regard to V with the sample complexity s(e, S), where 

C c = {£(CD(j): a G 57™, n G N} . 

Then £ is probably approximately correct under V with 
sample complexity s(e, 5). ■ 
This simple fact becomes useful in combination with 
the technique of well-orderings. Of course the Continuum 
Hypothesis is a particularly unnatural assumption in a prob- 
abilistic context (cf. (6|). But it is unnecessary. Martin's 
Axiom (MA) is a much weaker and natural additional set- 
theoretic axiom, which works just as well. 

IV. Learnability under Martin's Axiom 

Martin's Axiom (MA) says that no compact Hausdorff 
topological space with the countable chain condition is a 
union of strictly less than continuum nowhere dense subsets. 
Thus, it is a stronger statement than the Baire Category 
Theorem. In particular, the Continuum Hypothesis implies 
MA. However, MA is compatible with the negation of CH, 
and this is where the most interesting applications of MA 
are to be found. We need the following consequence of MA. 

Theorem 9 (Martin-Solovay): Let (57, /i) be a standard 
Lebesgue non-atomic probability space. Under MA, the 
Lebesgue measure is 2 N ° -additive, that is, if k < 2 K ° and 
A a , a < k is family of pairwise disjoint measurable sets, 
then D a<K A a is Lebesgue measurable and 



In particular, the union of strictly less than continuum null 
subsets of 57 is a null subset. ■ 

For the proof and more on MA, see [9 |, Theorem 2.21, or 
0, or HI, pp. 563-565. 

Lemma 10: Let ^ be a concept class and V a family of 
probability measures on a standard Borel domain 57. Consider 
the following properties. 

1) Every countable subclass of ^ is uniform Glivenko- 
Cantelli with regard to V. 

2) There is a function s(e, S) so that every countable 
subclass of c € is uniform Glivenko-Cantelli with regard 
to V with sample complexity s(e, 8). 

3) Every subclass c &' of ^ having cardinality < 2"° is 
uniform Glivenko-Cantelli with regard to V. 

4) There is a function s(e, S) so that every subclass c & 1 
of ^ having cardinality < 2 N ° is uniform Glivenko- 
Cantelli with regard to V with sample complexity 
s(e,S). 

Then 
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Proof: The implications © ^ ©, © ^ ©, © ^> 
© and © © are trivially true. To show © => ©, let 
S, e > be artitrary but fixed. For each countable subclass 
c € l ', choose the smallest value of sample complexity s — 
s(^", e, S). The function c €' H> s(^', e, 5) is monotone under 
inclusions: if C then s(V', e, S) < s{tf", e, S). If ^ 
is a sequence of countable classes, then the union U^Lj^ 
is a countable class, whose sample complexity value bounds 
from above s(^',e, 5), n = 1,2,.... Thus, the function 
c 6" i-> s(^", e, S) for <5, e > fixed is bounded on countable 
sets of inputs, and therefore bounded. 

Now assume (MA). It is enough to prove © ©. 
This is done by a transfinite induction on the cardinality 
« = |^"|, which never exceeds 2^° because c €' consists 
of Borel subsets of a standard Borel domain. For k = Ho 
there is nothing to prove. Else, represent c € as a union of an 
increasing transfinite chain of concept classes ff a , a < k, for 
each of which the statement of © holds. For every e > 
and n G N, the set 

{a e 57" : sup Ce ^ |m«0)(C) - KQ\ < e l 

= n Q<K {^e$7": S VL Pce vJ»n(cr)(C) - »(C)\ < e} 

is measurable by Martin-Solovay's Theorem |9] Given 5 > 
and n > s(e, 5, d), another application of the same result 
leads to conclude that for every n £ P(57): 



i® n { a G 57™ : 



sup |/Li n (<r)(C) 



M(C)|<e 



f| i a G $7" : 



,®it 



= inf /i* 1 ™ < a G 57" 

Q'<At 

> 1-6, 



sup 



\H n (a)(C) - n(C)\ < 



sup \n n {a){C) - n{C)\ < e 



as required. ■ 
Lemma 11: Let ^ be a concept class whose countable 
subclasses are uniform Glivenko-Cantelli with regard to a 
family of probability measures V. Let £ be a learning rule 
for ^ with the property that for every C G ^ , the set 



{£(C n o-) : a G 57™} 



(8) 



Under Martin's Axiom, all four conditions are equivalent. 



has cardinality strictly less than continuum. Under Martin's 
Axiom, the rule C is probably approximately correct under 
V. The common sample complexity bound of countable 
subclasses of ^ becomes the sample complexity bound for 
the learning rule C. 

Proof: Recall that 2 N ° is a regular cardinal, and thus 
admits no countable cofinal subset. Therefore, under the 
assumptions of Lemma, the cardinality of L c = U^ =1 C C ' n 
is still strictly less than continuum. Applying now Lemma 
[TOl and then Lemma [8] we conclude. ■ 

The following result establishes existence of learning rules 
with the required property. 

Lemma 12: Let be an infinite concept class on a 
measurable space 57. Denote k = 1^1 the cardinality of 
% '. There exists a consistent learning rule C for with the 



property that for every CGf and each n, the set C c,n (cf. 
Eq. (O) has cardinality < k. Under MA the rule £ satisfies 
the condition in Eq. (|T). 

Proof: Choose a minimal well-ordering of elements of 

= {C a : a < ft}, 

and set for every a G fl n and r G {0, 1}™ the value £(<r, r) 
equal to Cp, where 

(3 = min{a < k : C a fl cr = r}, 

provided such a /3 exists. Clearly, for each a < n one has 

£(a,C a na) e{Cp: (3< a}, 

which assures (|8). Besides, the learning rule C is consistent. 

Fix C — C a G ^ , a < k. For every j3 < a define Dp — 
{a G fi™ : C n a — Cp PI cr}. The sets are measurable, 
and the function 

fi™ 9 a ^ M (/;(c n d) A C) G R 

takes a constant value /j(CAC a ) on each set Dp\\J 7< pD 7 , 
(3 < a. Such sets, as well as all their possible unions, 
are measurable under MA by force of Martin-Solovay's 
Theorem [9] and their union is Q n . This implies the validity 
of Eq. © for £. ■ 

Lemma fTTI and Lemma [121 lead to the following result. 

Theorem 13 (Assuming MA): Let f be a concept class 
consisting of Borel measurable subsets of a standard Borel 
domain il, and let V be a family of probability measures on 
fi. Suppose that every countable subclass of ^ is uniform 
Glivenko-Cantelli with regard to V. Then the concept class 
^ is PAC learnable under V. In addition, there exists a 
common sample complexity bound for countable subclasses 
of ^f, and any such bound gives a sample complexity bound 
for PAC leamability of . ■ 

Finally, we can deduce our main result. 

Proof of (2)^(1) in Theorem^ 

The implication follows from Theorem[T3]with J> = P(f2) 
and the common complexity bound ■ 



V. Conclusion 

As a footnote to the fundamental theorem of statistical 
learing, we have proved that in the presence of a mild set- 
theoretic axiom (Martin's Axiom), PAC learnability of a 
concept class is equivalent to finiteness of VC dimension 
of , without any extra assumptions on the measurability of 
the class % 9 . The price to pay is giving up consistent PAC 
learnability, as well as constructive choice of a learning rule. 

It would be interesting to know to what extent the results 
remain true in the usual ZFC model of set theory. In 
particular, can an example of a concept class c £ on a standard 
Borel domain which has finite VC dimension and still is 
not cosistently PAC learnable, be constructed without any 
additional set- theoretic axioms? 
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